Conversation
When accessibleOnly=true with a search query, the old path fetched all installation repos (up to 5,000) on every debounced keystroke, then filtered with a Python list comprehension. Replace this with a cached set of accessible repo IDs (5-min Redis TTL) combined with the GitHub Search API, reducing each typed query from O(pages) API calls to a single search call plus a Redis lookup. Refs VDY-68
…h API Switch from Search API + cached ID set to caching the full repo list and filtering locally. This avoids the Search API's shared 30 req/min rate limit and uses sentry.cache.default_cache (Redis-backed) instead of django.core.cache (DummyCache in Sentry). Refs VDY-68
Keep the cached repo list unfiltered so the cache is a faithful snapshot of the GitHub API response. Apply the archived filter in get_repositories alongside the other transforms. Also let the accessible_only path handle both with and without a query. Refs VDY-68
The Search API does not return archived repos, so the archived filter should only apply to the /installation/repositories paths.
Move no-query path first since accessible_only is only useful with a query (repeated keystrokes). Combine archived and query filters into a single pass through to_repo_info.
Strip raw GitHub repo dicts down to the 5 fields used by get_repositories before storing in the cache. Reduces per-integration cache size from ~3KB per repo to ~100 bytes.
getsentry configures CACHES with memcached in production, so django.core.cache.cache works and matches the pattern used by the rest of the integrations codebase.
| def get_accessible_repos_cached(self, ttl: int = 300) -> list[CachedRepo]: | ||
| """ | ||
| Return all repos accessible to this installation. | ||
| Cached in Django cache for ``ttl`` seconds so that debounced | ||
| search keystrokes don't re-fetch all pages from GitHub. | ||
|
|
||
| Only the fields used by get_repositories() are stored to keep | ||
| the cache payload small. | ||
| """ | ||
| cache_key = f"github:accessible_repos:{self.integration.id}" | ||
| cached = cache.get(cache_key) | ||
| if cached is not None: | ||
| return cached | ||
|
|
||
| all_repos = self.get_repos() | ||
| repos: list[CachedRepo] = [ | ||
| { | ||
| "id": r["id"], | ||
| "name": r["name"], | ||
| "full_name": r["full_name"], | ||
| "default_branch": r.get("default_branch"), | ||
| "archived": r.get("archived"), | ||
| } | ||
| for r in all_repos | ||
| ] | ||
| cache.set(cache_key, repos, ttl) | ||
| return repos |
There was a problem hiding this comment.
This isn't really get_accessible_repos_cached, it's get_repos_cached
| r | ||
| for r in all_repos_cached | ||
| if not r.get("archived") and query_lower in r["full_name"].lower() | ||
| ) |
There was a problem hiding this comment.
I think that rather than making the cache use conditional on accessible_only, we should have use_cache=False. We should have an assert like assert not use_cache or not query since the cache doesn't work with the query
There was a problem hiding this comment.
A little confused by this comment because the cache is set up to work with the accessible_only variant of querying.
accessible_only is a means to query accessible repos, default query behavior was exposing repos that sentry does not necessarily have access to (public repos not selected during installation configuration).
the cache in its current form its only enabled for the accessible_only path because I didn't want to alter behavior for existing consumers.
that said cache could instead be opt in and applied to either the not query path or the accessible_only path, is that along the lines of your thinking?
Further changes here are introduced with pagination support #112591
There was a problem hiding this comment.
My main point is that it doesn't really make sense to make the cache specific to accessible only. We're caching all the repos, and then we're filtering to accessible only in python. So if we add use_cache=False, then in your usages you pass use_cache=True, is_accessible=True it's more general and allows us to use the cache in other places later, if we want to. In general, I feel like Claude tends to do things in an overly specific way so I just wanted to guide us away from that here.
There was a problem hiding this comment.
Okay, I will alter things so that cache can be applied to either the not query or accessible_only paths.
Filtering of accessible is not handled via python filtering (its a different github api path), the "archived" filter is kind of a red herring there.
accessible vs non accessible:
accessible repo fetch client.get_repos(
accessible agnostic repo search client.search_repositories(
There was a problem hiding this comment.
Oh sorry, I misread here... although I'm confused about the split that exists in the current code because it looks like it fetches all repos from github and then filters out archived repos when it's accessible.
It seems like if there's no query, we should always be using self.get_client().get_repos? I'm not totally sure why we'd use the search api when we're fetching everything
There was a problem hiding this comment.
Ok, discussed on slack and I understand what's going on a bit better here now.
Yeah, I think it still makes sense to have use_cache be explicit. Then whenever is_accessible is true, we can optionally use the cache or not based on it, and just have assert not use_cache or not is_accessible to make sure that we don't confuse folks who use is_accessible.
Probably it'd be nice to get rid of is_accessible completely but idk if we're relying on the behaviour implicitly somewhere.
| # accessible_only: fetch and filter accessible repos (cached) | ||
| # avoids re-fetching all pages on every debounced keystroke and | ||
| # avoids the Search API's 30 req/min shared rate limit. |
There was a problem hiding this comment.
This is specific to one part of the frontend, I'd probably remove this from here, or move it to the api
| if not r.get("archived") and query_lower in r["full_name"].lower() | ||
| ) | ||
|
|
||
| # Query without accessible_only: existing search behavior |
| repos = [r for r in repos if query_lower in str(r["identifier"]).lower()] | ||
| return repos | ||
|
|
||
| # No query: fetch all accessible repos (without cache) |
| r | ||
| for r in all_repos_cached | ||
| if not r.get("archived") and query_lower in r["full_name"].lower() | ||
| ) |
There was a problem hiding this comment.
Ok, discussed on slack and I understand what's going on a bit better here now.
Yeah, I think it still makes sense to have use_cache be explicit. Then whenever is_accessible is true, we can optionally use the cache or not based on it, and just have assert not use_cache or not is_accessible to make sure that we don't confuse folks who use is_accessible.
Probably it'd be nice to get rid of is_accessible completely but idk if we're relying on the behaviour implicitly somewhere.
Add explicit use_cache parameter to get_repositories instead of implicitly tying caching to the accessible_only flag. This makes caching an independent concern that callers opt into explicitly.
Rename method and cache key to reflect that the cache is not specific to the accessible_only path. Remove implementation-detail comments from get_repositories.
| query: str | None = None, | ||
| page_number_limit: int | None = None, | ||
| accessible_only: bool = False, | ||
| use_cache: bool = False, |
There was a problem hiding this comment.
use_cache parameter is accepted but ignored in GitHub Enterprise integration
The use_cache parameter was added to the method signature for interface compliance but is never used in the implementation. When accessibleOnly=true is passed to the repos endpoint, the endpoint calls get_repositories(..., use_cache=True) (line 76 of organization_integration_repos.py). The GitHub integration correctly calls client.get_repos_cached() when use_cache=True, but the GitHub Enterprise integration silently ignores this flag and always calls get_client().get_repos() without caching. GitHub Enterprise users will not receive the caching benefit this PR is intended to provide.
Verification
Read github_enterprise/integration.py lines 224-254 to confirm use_cache parameter is never referenced in method body. Read github/integration.py lines 355-376 to see correct implementation using use_cache. Read organization_integration_repos.py line 76 to confirm use_cache=accessible_only is passed to all integrations.
Identified by Warden sentry-backend-bugs · CHR-VP4
There was a problem hiding this comment.
Fix attempt detected (commit 3bf81a4)
The use_cache parameter was added to the GitHub Enterprise get_repositories signature for interface compliance, but the method implementation still ignores it and unconditionally calls get_client().get_repos() without any conditional caching logic, identical to the before state in the critical execution paths.
The original issue appears unresolved. Please review and try again.
Evaluated by Warden
There was a problem hiding this comment.
this change is siloed to github only, not GHE
Avoid caching the initial no-query load when accessibleOnly is set. Cache is only useful for debounced keystroke searches, not the first page load which should always return fresh data.
Summary
OrganizationIntegrationReposEndpoint(/integrations/{id}/repos/) lets the frontend search for GitHub repos available to a GitHub App installation. When called withaccessibleOnly=trueand a search query (as the SCM onboarding repo selector does on each debounced keystroke), the previous implementation fetched all installation-accessible repos from the GitHub API (up to 50 pages of 100 = 5,000 repos) on every request, then filtered with a Python list comprehensionsentry.cache.default_cache(Redis) for 5 minutes, and filter locally on subsequent requests — reducing each typed query from O(pages) GitHub API calls to zeroTest plan
get_repositoriestests pass (6/6)test_get_repositories_accessible_only_caches_reposverifies cache hit path skips/installation/repositoriescallsaccessibleOnlysearch returns instantly from cacheRefs VDY-68